6 High Dimensional Linear Regression

Continue with last model, we consider the model (plug in all possible ci 's): yt=β0+β1(t1)+β2ReLU(t2)+(1)+βn1ReLU(t(n1))+εt.
Note that it is different from our last model: yt=β0+β1(t1)+β2ReLU(tc1)(2)++βk+1ReLU(tck)+εt.
The new model does not have parameters ck, so it is linear (discussed here). Also (2) will be used with a small k, while (1) can have large n. In short, (1) is a high-dimensional linear regression model, and (2) is a low-dimensional nonlinear regression model.

(1) has ample parameters so it is a flexible model.

1 Parameter Interpretation in (1)

Let μt denote the deterministic part in (1) (without εt): μt=β0+β1(t1)+β2ReLU(t2)++βn1ReLU(t(n1)). Then (1) can be rewritten as yt=μt+εt,εti.i.dN(0,σ2).

The noise term can have several interpretations.

Plug in t=1, we have β0=μ1. Plug in t=2, we have β1=μ2μ1. Similarly, βt=(μt+1μt)(μtμt1).

If we use this model to represent logarithm of population, then

So they have different scales.

2 Parameter Estimation

2.1 Unregularized MLE

For linear regression model (1), we can estimate as usual by MLE: t=1n(ytβ0β1(t1)β2ReLU(t2)βn1ReLU(t(n1)))2.
We have σ^MLE=RSSn.
However now the number of data points is equal to coefficients, so RSS=0,σ^=0. The unbiased estimate σ will not exist because RSSnp and n=p.

So traditional estimates overfit the data.

2.2 Regularization

Now we add regularization. β^ridge(λ) (ridge estimator) is the minimizer of t=1n(ytβ0β1(t1)j=2n1ReLU(tj))2+λj=2n1βj2.
We also have LASSO estimator β^lasso(λ): t=1n(ytβ0β1(t1)j=2n1ReLU(tj))2+λj=2n1|βj|.

Correspondingly, plug in μ, RIDGE goes to Hodrick-Prescot Filter: t=1n(ytμt)2+λt=2n1[(μt+1μt)(μtμt1)2].
LASSO goes to L1 -trend Filter: t=1n(ytμt)2+λt=2n1|(μt+1μt)(μtμt1)|.

Fact (Simple Ridge)

For f(β)=(yβ)2+λβ2,λ>0, we can easily find the minimizer β^=y1+λ.

We can rewrite this into matrix form: ||yXβ||2+λt=2n1βj2. Take J=diag(0,0,1,,1), then (||yXβ||2+λt=2n1βj2)=2XTy+2XTXβ+2λJβ. Let it be 0, we have (2.1)β^ridge(λ)=(XTX+λJ)1XTy. Compared with regular β^=(XTX)1XTy, the only difference is the λJ term.

For LASSO, f(β)=(yβ)2+λ|β|.

Fact (Simple LASSO)

The minimizer of f(β)=(yβ)2+λ|β| is given by {yλ/2,y>λ/2,y+λ/2,y<λ/2,0,λ/2yλ/2.

3 Cross Validation for Selecting λ

We can use cross validation to select λ. First split the total set T={1,,n} into Ttrain,Ttest. (say 80%: 20%). For this split, fit the model in Ttrain, and obtain β^trainridge(λ) as minimizer of tTtrain(ytβ0β1(t1)β2ReLU(t2)βn1ReLU(t(n1)))2+λ(β22++βn12), and β^trainLASSO(λ) as tTtrain(ytβ0β1(t1)β2ReLU(t2)βn1ReLU(t(n1)))2+λ(|β2|++|βn1|).
Using these estimates, predict yt for tTtest: y^tridge(λ)=β^train,0ridge(λ)+β^train,1ridge(λ)(t1)+β^train,2ridge(λ)ReLU(t2)++β^train,n1ridge(λ)ReLU(t(n1)),y^tlasso(λ)=β^train,0lasso(λ)+β^train,1lasso(λ)(t1)+β^train,2lasso(λ)ReLU(t2)++β^train,n1lasso(λ)ReLU(t(n1)).
Denote Test-Errorridge(λ)=tTtest(yty^tridge(λ))2,Test-Errorlasso(λ)=tTtest(yty^tlasso(λ))2.
Going over all splits, we can have the total test error: AllSplit-Test-Errorridge(λ)=all splitsTest-Errorridge(λ),AllSplit-Test-Errorlasso(λ)=all splitsTest-Errorlasso(λ).
We apply this to a set of candidate λ values and choose the optimal λ that minimizes the all split test error.

A common split for test set may be, {5k+i,kN} for i=1,2,3,4,5.